Goto

Collaborating Authors

 token id


Parameter-Efficient Transformer Embeddings

Ndubuaku, Henry, Talhi, Mouad

arXiv.org Artificial Intelligence

Embedding layers in transformer-based NLP models typicall y account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. W e propose an alte rnative approach in which token embedding vectors are first generated determini stically, directly from the token IDs using a Fourier expansion of their normalized v alues, followed by a lightweight multilayer perceptron (MLP) that captures hig her-order interactions. W e train standard transformers and our architecture on natu ral language inference tasks (SNLI and MNLI), and evaluate zero-shot performance o n sentence textual similarity (STS-B). Our results demonstrate that the propo sed method achieves competitive performance using significantly fewer paramet ers, trains faster, and operates effectively without the need for dropout. This pro of-of-concept study highlights the potential for scalable, memory-efficient la nguage models and motivates further large-scale experimentation based on our find ings. The code for reproducing and pre-trained weights are available at https://github.com/HMUNACHI/pete .


Byte BPE Tokenization as an Inverse string Homomorphism

Geng, Saibo, Gambhir, Sankalp, Wendler, Chris, West, Robert

arXiv.org Artificial Intelligence

Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.


On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Xu, Kevin, Sato, Issei

arXiv.org Artificial Intelligence

However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.


OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

Xue, Fuzhao, Zheng, Zian, Fu, Yao, Ni, Jinjie, Zheng, Zangwei, Zhou, Wangchunshu, You, Yang

arXiv.org Artificial Intelligence

To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based large language models (LLMs), we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.


Memory Augmented Language Models through Mixture of Word Experts

Santos, Cicero Nogueira dos, Lee-Thorp, James, Noble, Isaac, Chang, Chung-Ching, Uthus, David

arXiv.org Artificial Intelligence

Scaling up the number of parameters of language models has proven to be an effective approach to improve performance. For dense models, increasing model size proportionally increases the model's computation footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms regular MoE models on knowledge intensive tasks and has similar performance to more complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory.


Position Masking for Language Models

Wagner, Andy, Mitra, Tiyasa, Iyer, Mrinal, Da Costa, Godfrey, Tremblay, Marc

arXiv.org Machine Learning

Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. This is an effective technique which has led to good results on all NLP benchmarks. We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids. We follow the same standard approach as BERT masking a percentage of the tokens positions and then predicting their original values using an additional fully connected classifier stage. This approach has shown good performance gains (.3\% improvement) for the SQUAD additional improvement in convergence times. For the Graphcore IPU the convergence of BERT Base with position masking requires only 50\% of the tokens from the original BERT paper.